DOCORBIT BISOBB 



TH 010 108 

Toako, ThoMj* N, : mis, Robert* H. 
Evaluation of infernal Loaic conpetence. national 
Thinking Reports Fuaber 3. 

Illinois Oniw«, Orbana. Bureau of Educational 
Research, 
79 

MF01/PC0? Plus Postage, 

Cognitive Test**: Criterion Referenced Tests; 
♦Critical Thinkina: Evaluation Methods: *Logical 
Thinking: Ncr» Peferenced Tests: state of the Art 
Peiieiifst *student Testincr; *Test Construction; Test 
Feliability: *Test Reviews: * Test Selection; Test 
Validity 

IDENTIFIERS. Cornell Critical Thinking Test; Hatson Glaser 

Critical "^hinkina Appraisal » 

ABSTRACT 

A discussion of evaluitina infornal logic competence 
centers on V. « identification of currently available tests and on a 
description of test theorv, to aid in test selection. The following • 
concepts' in test theory are discussed; norm -referenced a^d 
criterion-referenced *ests, ♦rue scores, reliability, validity, test 
selection and evaluation, test cor^truction, and test format. The use 
of tests to assess student performance and in various ^iperinental 
.designs* Is also explained, as well as responsive evaluation, . 
eemi- structured evaluation, surveys md Questionnaires, and • 
longitudinal follow-up studies. (MH) 



BO 183 589 

AUTHOR 
TITLE 

INST IT Ur ION 

PUB DATE 
NOTE 

EOES PRICE 
DESCRIPTORS 



J 




* Peproductions supplied by ^OPS are *he beet that can be made * 

* ' from the orlqinal document, ♦ 




/ 



EVALUATION .OF 



us Of PAHrMf NTor Hi AtrH 
COUCArtON4 WCirABE 
NAtlONAl INItirUTtO^ i 
lEOuCATlON 

Ui'\ p.r. ».Vt N» MAS BI-fN IflPlfO * 
Ji\'t I 1> f «Aj t, V AS wt < f ivt O f WOM - 

St AM P lu\Nn» NH I s'.AW-i y «t P«F • f 
•.(N«.i>» < At NA». ISA. •NMituff Of I 



INFORMAL L06.IC COIy!P*ErBNCE 




• •• BY 

THOMAS N. 
V Alio 
ROBERT- H. 



tOMKO 
ENNIS 



RATIONAL THINKING REPORTS 

NUMBER 3 



• PERMISSION TO REPRODUCE THIS uL 
MATERIAL HAS'BEEN GRANTED pY (^^ 



TO THE EDUCATIONAL RESOURCES 
INFOHMAIION CENTER (ERIC). ' 



General Editor of Rational Thinking Reports' 
Robert H« Ennis 



^^^^ 

T 

o 



*0 " 



Published by 

s 

The Illinois Rational Thinking Project 

Bureau of Educational Research 
University of Illinois at UrbanaoChampaign 
Urbana, Illinois 
1979 



Evaluation of Informal Logic Competence^ 

. . • ■ ' ■ " • ■! 

^ Thon.ij? N* Tomko and Robert H* Ennis ^ | 

^ Buroau of 'K<hicat ii>nal Research 

Un I ve rB 1 1 y t . Ill no is, U rb ana -Champa I Rn 

One ot the distinguishing ioaturus of tho dovuluplni; informal loyic? move- 

\ 

mont is ItH cuni^rn with pi*dagogy.^ Infornuil logic ttachors want to know what 
skills and lunicfpts art* impDitant for tht^lr students to know, how those skil^ls 
and < oneejUi> can best be taugh? , and how om- can determinr if they have been 
successfully taught* Certainly thf answers to the latter two questions will 
require sonu» empirical res.earch. In this paper we will examine tests and 
evaluation pnu^edures in the hope that we can shed some ligljt on these two 
questions. Our intended audience Is people who have the above concerns, but 
whi> are unfamiliar with avaUnble tests and/or with the field of testing. 

Tests 

One* naturally turns to tests ;is a starting point, for the evaluation of 

/ 

leadiin^;. Altlu>ugh ni>t all pei>ple agre** on the value of tests, they are a 

r 

vt*ry practical and widely-used evaluation tool. 

Cur rently av.illahit* tests 

The Watson-(' I aser Crit{<Ml nutikin>; Appraisal > litis test is perhaps the 

most widelv^ust»d inst rumen V in the a;'ea of logic and critical thinking. It 

Has two lv>rms, Ym and Xir (revised in 19(>A), oivyh 100 Items long and divided 

lnt<» fivt* suhtt*sts: 

InUrence - al)iMty to dis<-rinunat(* .imcnig dtgri^es o( truth, falsity, 
•or probability of inferi^nci^ drawn from glvt»n facts or data 

Recoj;nition ot Assumptions - ability to ri*cogni/-e unstated assump- 
tions in g,iven ass*M-tions or propositions 

IV'duet ton - ability ti^ rea:5i>n from j'.lven premiSi's, reiH>gti i t ion of 
lt>j;ical impl i cat iiMi 

InterpriM ari DU ability to v^;el<\h evidence an<l to discr iminatt^ 

among (it>grt't»s i>f probable inlt*r<*nct* 
• n 

ERXC ^•^•"i'^^ RobiTt I.lim, ^Stophon N.>rris, D<>nitt Phillips .nul Uruce" 

m^aam Stow.irt for thoir comraonts on and criticism oT .1 draft of this paper. 



-2- 



Evaluatlon of Arguments - ability to d Isrrlmlnate between strong 
and weak, important and irrelevant irgumenta 

The type of answer to be selected varies vita the sub-tests. For example 

in Test 1 (Inferenee) students must decide, after reading a paragraph, 

whet^ier some propi>sed inferences .ire true, probably true, probal^ly talse, 

false, or cannot be ju<lR*^*d due to insufficient data. In Test 2 (Recognition 

of Assumptions), tlie examinee must decide, given n statement, wfet^ber or not ^ 

another statement is ''presupposed or taken for granted"^."N^For example, given 

the statement *'Let us immt»diately build superior armed force and thus keep 

peace and prosperity, " one must decide whether the proposed assumption "The 

building of superior armed force guarantees the maintenance of^peace and 

prosperity" is "necessarily" made or not made. In T^st 3 (Deduction) and 

Test 4 (Interpretation), one must decide whether a conclusion does or does 

n4^t follow. Tn Test 5 (Kvaluation of Arguments), one must decide whether 

slu^rt, tWi>-or three-sentence arguments are strong or weak. The questions in 

each test are preceded hv sample items and an explanation of what the student 

is to do. 

The ti*st does not fover some aspects of critical thinking which one . 
might wish to cover. For example, no semantical skills or concepts are 
covcri\i, so there an* no questions di»alin}> witl) dt»fiuition or ambiguity. 
Theri* are ni> questions dealing with the reliability of observation state- 
mc*nts or stattMiu'uts made* by authorities. 

To Lhv t'xtfut that i he -Wat.sun-Cl.isi'i' tost mi-iisurcs the ability to Uoal 
with iiuliution, ft has pnibloms that alst> t r<nihlr thi- Ct>tiU'lJ (Critical 
Think ip>' T»'.s(>; (CCTT) , «i i ^;l•usKl•^l h(*low. In i)rdcr to answer sonu' of the 
itrm.-: cor t t.< t I y, rx.iitutu'os must havt* ciTtain haok}»roiuui ktiowli-tljM* . If thvy 
do not, thov tn.iy wnswor incorror t 1 y ovon thouj-h tho wa;,' thoy iiMchi'd Ju' 



ERIC 



4 



n 



-3- 



anawiT might be judged to be acceptable* This problem appears in just about 
any Induction tesy^ We have found that it is very difficult to const^ct 
test items on indue tiv^reasonlng that have one answer that Is clearly the 
best* One can almost always defend alternative answers by making certain 
reasonable assumptions that were not foreseen by the test maker • 

Another problem with the Watson-Glaser test is its concept of an 
a^^sumption. The directions to the section, "Recognition of Assumptions", 
say^ "If yi>u think the assumption is n ot necessarily taken for granted in 
the statement, make a heavy line under 'ASSUMPTION N.OT MADE* on the Answer/ 
Sheet*" (p* 4) We can probably neglect the problem of referring to the^ 
p_roposod assumption as an "assumpt ion", but there is a more serious problem* 
The Winding encourages the test taker to look for something that is necessarily 
.taken for granted* Possibly presuppositions (of *^ the type discussed by 
^trawsvnu 19S2) are "necessarily" taken for granted^ but premise-type assump- 
tions are not. Tlu»re is always another and different premise or premises 
to fill a real gap in an argument* So no premise-type assumption is neces-- 
sarlly taken for iu anted (Knnis^ 196U elaborates this\^oint) * 

A third problem is tliat one's answers to the items in Test 5» Eval- 
uation of Arguments^ often depend oix one's politics and values* It does 
not seem f*iir ii> mark a person wrong because of the person's politics and 
values. For examplis Items 89 and 91 are keyed weak arguments'on the issue 
of whethi^r the l^nlte<i States government 'should try to inform the public of 
scnight --af t er results of its scientific research programs in the areas of 
"new weapons^ ecjuij>ment^ devices^ etc/*. These it-oms are: 

89* No; soru» pooph* become critical of the government when widely 
puhlici/.otl projects turn out unsucci'ssful ly . 

91- . Yes; tlu projects ai'e supported by taxes and the j'.eneral 
pul)l kc would like to know how their money is to hv sptnt. 

ERIC o 



-4- 



We agree with the key on Item 89 and dlaajgree on Item 91» but we. can see 
how people with value ||08itlon different from ours might well answer just 
the opposite. We are not here urging a relativistic ethics» but do not 
think that a person* s critical-thinkingScore should depend on that person^s 
politics and values in such cases* 

A fourth problem for some is the orientation of the ^cst to the United 
States^ exemplified in the above«»mentioned issue in Test 5/ Students from 
other Engllsh-spexiking countries might be penalized for being less familiar 
with United States government stricture, policies^ and problems* 

The Cornell Critical Thinking Tests * These are actually two diff^r'fent' 
.^>sts» Level X and Level Z (Ennis & MiJlra^, 1971a» b, c). They are not 
parallel fonuH of the same test» although they are both general measures of 

critical thinking ability* Level X is meant to be used fbr#testing students 

* 

roughly from junior high school through first year college age* Level Z is 
Intended for testing examinees of college age on up, although high Ability 
84*eondary students could alio take the test* 

The rationale for the test is based upon a concept of critiCiU, thinking 
set fortli In Ennis (1962)* /Ibfcordlng to the test manual, ^^"^^ 



Itica^ 



A critical thinker is characterized by proficiency in judjTlng 
whe^ther: 

!• A staS^ent follows from the* premises* 
2. Something Is an assumption* 
3* An observation statement is reliable* 
A* An alleged authority is reliable* 
5* A simple generalization is wj^rranted* 
6* K hypothesis is warranted* 
7. A .^theory is warranted. 
8# An argument depends upon ad ambiguity. 
9. A statement Is overvalue orv^verspecif ic . 
10. A rtason is relevant. (Ennis & Mlllman^ 1971a) 



ERIC 



6 



Not all of these aspeccs of critical thinking are covered by each teiSt. 
Level X does not cover aspects 7, 8, and 9, while Level Z does not cover 
aspect 7. 

In the Level X test, examinees read a story about space explorers on 
a dlstantplanet and are asked questions a» the story unfolds. For example, 
at on^ point the explorers are said to be watching a group "of beings who 
are/tanding around a campfire. In one type of question, the examinees 
mret decide whether the first of the following underlined statements is 
more reliable than the second, the second more reliable than the first, or 
whether neither is more reliable than the other. 

A. The mechanic, looking through his field-glasses, says, "They 

are tan-skinned creatures with furry spots * " 

• • • 

B. The anthropologist, looking through his field glasses, says, 
"Th ey don't have furry spots* • They are wearing the skins 
of animals . " 

C. X Equally reliable or unreliable. 

The Level Z test is somewhat more difficult and the questions and 

directions are more complex. For example, after an <;xp*eriment and its 

results are described, examinees are asked whether certain additional infor- 

mation, if true, would make the results a) more certain, b) less certr.in, 

# 

or c) neither. There is also a section on semantic skills and conc<^pts not 

's » 

covered on* the Level X test. 

Like the Watson-Glnscr test, the Level X test has its problems with 
Induction. It tries to avoid some of these problems by asking whether 
propoHod evidence cou'Yit.q for , against, or is neutral with respect to a 
certain hypothesis; tlvit is, .It only asks in wliich direction th\i evidence 
points. But as .in the case of the Watson-Glaser te«t^ one can, by making 
reasonable assumptions, dispute the .keyed answers. We feel that <hls is 



the major problem that people writing tests of Inductive- reasoning ability 

* 

must overcome* ' ^ 

'Some of the subsectlt^ns of the Level Z test seem a little shorti For 

example, the section on judging the reliability of observations and authorl- 

ties is only A items long, as compared with 21 items on the Level X test* 

The Watson-Glaser Critical Thinking Appraisal s the Cornell Critical * 

Thinking Tostt Lt^vvl X ^nd Level Z are, in our opinion, the only general 

tests of critical, thinking currently available. But there are two other 

availabl/e tests that claim to measure or appear .to attempt to measure general 

critical thinking ability, and there i& a set of "indexes^" that might jointly 
» 

be construed as a critical thinking test. 

One tost Is^ The Uncrittcnl Inference Test by Will£am V. Hane.y (1975). It 
consists of iliroc stories, each followed by a series of statements (76 items 

altogether) which the examinee must decide are either "definitely true", 

"definitely false", or "?" (questionable) tased on the information given 

in the story. Due to the large number of answers keyed "?", a hardened 

skeptic or "pathological doubter" would receive a high score on this te^. 

(This was a problem with an earlier version of the Vatson-^Glaser test, also* 

See Ennis, 1958); The test appears to be essentially a test of whether a 

person is willing to infer at all beyond what is very explicitly stated, and 

« 

might better be called a critical uninference test. 

The other Is entitled Logi cnl Reasoning , developed by J* P. Guilford ^ 
and A. F. Hertzka (1935), designed to assess what Guilford calls the factor 
of evaluation of nemantic Implications. Although he claims that this factor 
is known as critical thinking ability, the test itself consists of forty 

* 

items* each of wl^ich presents two premises of a syllogism and asks the examinee 



8 



4 

to pick the correct conclusion from four possible alternatives. This seems 
to be a test of only one aspect of critical thinking, viz., class reasoning. 

Then there are several aspect-specific .tests *of critical thlnkinjl 
ability. The Cornell Conditional Reasoning Test (Ennis, Gardiner, Guzzetta, 
Morrow, Paulus, & Ringel, 1964) 'and the Cornell Class Reasoning Test (Ennis, 
J Gardiner, Morrow, Paulus, & Ringel, 196A) are designed to test the abilities 
named In their titles. The Evaluation Aptitude Test , by D. E. Sell (1952), 
consists of 36 syllogisms. The test is divided into two parts, one containing 
, neutral 5.teras and the other containing emotionally-loaded items. The test 
is meant to be used. In pa^T^* to measure the degree to which emotional bias 
influences deductive-reasoning ability. 

A set of aspect-specific tests has. been constructed, by the Instructional 
Objectives Exchange (lOX), formerly associated with the Center for the Study 
of Evaluation at UCLA. * It is an organization that serves as a clearinghouse 
for instructional objectives and tests designed to measure whether or not 
students have attained those objectives. Each test (or "index", as they 
*• put it) is meant to measure the attainment of one or more stated objectives. 

Judgon ^entt ^ Deducti v e Logic and Assumption Finding (Instructional, 
Objectives Exchange, 1971), seven objectives from the area of informal logic 
are presented in conjunction with five indexes designed to measure the 
attainment of those objectives. The Conditional Reasoning Index and the 
Claims Reason !ng Index are ineant to measure two objectives each. The work 
of Knnis and Paulus (1965) is cited as a basis for the items included, but one 

• ■ 

• should ^ho to the sense in which the lOX authors use the term 'valid'. The 
direi'tlouff ask examinees whether a proposed cone 1 us ton is valid or invaHd, 



given one or more premi'sos, rather than whether an argument, or line of 
reasotjing. Is valid. ^1 *^ 

ERIC '9 



Two other objective^ and their corresponding Indexes deal with assuaptlon- 

• » . «. • •* 

findjAg. However, the term 'assumption* has several senses 'and this could 

X * 1 • 

cause some problem^ If^^one Is not alerts* /Objective 5 in lOX (1971, p. 4) 

states the following: 

Given a series of statements, each of vhlch is .followed by several 
proposed assumptions, the' students will determine whether, within 
each question set,' each of. the assumptions listed is necessary to 
the particular statement. 

Assumption Recognition Index I., is the test associated with this objective. 

Objective 6 on the other hand. Involves <judglng> whether statements asia 

necessary to arguments, a skill sometimes called **gap*fllling**. Assumption 

Recognition Index II is the test associated with objective 6. • , 

After being presented with a statement in the first test and an argument 
in the second test, the examinee is asked to decide whether a person who offers 
an argument or stateoient must accept each of a series of additional statements 
In order to be ^'reasonal;^ and consistent'* « In Assuttfptlon Recognition Index I t 
the keyed answers appear to be either (Strawsonl^^n) presuppositions or CGrlcean) 
impllcatures^ and In Assumption Recognition Index II » the keyed answers 
are either gap-fillers or Impllcatjarese While one might argue that pre- 
suppositions **are» In some sense necessary for certaltl statements and that 
pertain gap-flllers are needed (though not logically necessary) for some argu- 
ments (given a context )» It is not clear that Impllcatures are necessary 
for either statements or arguments* In addition^ \islng the liibel ''assumption 
recognition'* for tests* which involve Identifying presuppoiltlphs, Impll- 
catureSf and gap-fillers may be misleading to the test user searching for a 
test to cover a homogeneous notion of assumption recognition* 

The final objective and its accompanying test, Recognizing Reliable 
Observations , is also based on the work of Ennis (1962) • The items in this 

V 



t«st seem clear and well-vrltten, but given the number of criteria listed 
by Ennls for judging the reliability of observation statenents, there is 

9 

<«ome question as to whether the test is long enough (10 items). This test 
an4 the other lOX fests could be useful aids f^r the informal logic teacher, 
but, as is the case with any test chat one has not constructed oneself, they 

♦ 

should be examined carefully before they are used. There is no discussion 
of cut-off scores, nor any argument offered l^or the representativeness of 
the items* 

The tests mentioned above are the ones we have found that are still 

« 

in print.* They are contained in a critical thinking test file we are^ 
developing at the Illinois Rational Thinking Project, for which we hope 
to obtain copies of all existing general and aspect-specific critical 
thinking tests. Ohe Project membe^, Bruce Stewart, has written a collection 

. of reviews of tests in' the areas of critical thinking and informal logic 
(Stewart, 1979), containing reviews of about thirty treats. 

Another source for Infonhation about tests of all sorts is Buros* 
Mental Measurements Yearbook (1972). Buros lists testd in many areas and 

^^ilso includes reviews of some. The 1972 edition is now somei^hat out of 
date, but a new 'edition should appear soon. * 



♦Perhaps Sell's Evaluntlon Aptitude Test Is out of print now. It Is not 
listed in the most recent Psychometric Affiliates catalogue. 



n 



-10- 

Concepts of Test Theory 

'Although the selection of tests of lAtereat to the teacher of Informal 
logic is unfortunately sonewhat litaited, it is useful to know what criteria 
are generally used to^ choose a test. An understanding of these criteria 
and an acquaintance with their theoretical background can help one get a 
better overall picture of the process of evaluation of teaching and research 
in informal logic. > ' 

Norm^referenced and criterion-referenced testing . A. test that Is 
intended to'measure the absolute standing of individuals with respect to 
some standard performance (oft^ mastery) is generally called by tu.ai. 
theorists a cr Iter ion-re fergnced or content-referenced test ^ (*Cri jtioa^ 

.refer:enced* is a. term ihtrutiuced by Glaser (Glaser & Klaus, 1962).) Its 

• *- . • • » 

inclusion of tl^a^ord 'criterion' is somewhat unfortunate, since criterion' 

hds another use in test theory that could cause some confusion. (See the 

discussion of predictive validity below.) 

Someone might alternatively be interested, not in assessing degree of . 
. . . • ^ 

.achievement « but rather in determining differences among groups of students 

and individual students. A researcher making comparisons would use a test 

that assesses the relative standing of individuals witfi respect to, the 

possession of a trait or. traits; This type of test is generally called by 

'^est theorists ra norm-referenced test . A person's score on such a test is 

an indication of how well *he or she performed as copipared with other indl- 

viduals who cook. the test. Someone who scores at the 90th percentile of 

a norm-referenced testf has done as well cr better than 90% of the people 

■* * • 

with whom he or she is being compared. Such scores, however,' do not indicate 
Whether the level of mastery was low or "high. So, for example, someone who 



scores at the 90th percentile on a test of reading ability nay not be a very 
good leader* Ue or she is Just better than 90Z of the group with whom he 
or she is being compared (the norm group) . ^ 

Since it appears, that the same test can be used both as a criterion- 

• - ♦ * 

referenced test and. a norm*-*referenced test^ we propose to change the labels 

« *<» 

Slightly and talk of criterion-referenced testing^ and norm-referenced testing. 

This labeling explicitly recognizes the dependence of the'distinctioci upon 

purpose and interpretation In. the given situation. This rej^beling relieves 

us of the burden of classifying every test as one or the other type» a task 

ve have fqjund in practice to be impossibles * 

At the present time» the bulk of published tests are developed and ' 

defended for the purpose of norm-referenced testing. Examples of such 

tests ii^cXude IQ tests and college entrance exams^ such as the Scholastic 

Aptitude Test (SAT) and the Graduate Records Exam (GRE) . The Watson-Gl&eer . 

/ 

and Cornell Critical Thinking Tests might be usable for either purpose , 
depending on whether they are judged by a test user to be adequate in a 
particular* situation* The IQX tests are designed to be used for criterion^ 
ref^erenced testing. Competency testlng> which is becoming' popular in . 
eldmentary and secondary schools » is a type of criterion-referenced testing. 
(For a discussion of problems with competency testings in addition. to the 
general problems with criterion- referenced testing that we will mentioHy 
see Smith, 1975.) 

The fact that a test was designed for norm- referenced testing does not 
preclude its use for criterion-referenced testing (or vice-versa). However 
some knowledge of the theories or models behind a test is helpful in deciding 
whether a particular test is appropriate for a use* one has in mind. 



13 



Most ot the testing ternlnology that we shall Introduce was originally 
used in" the context of classical mental test theory, which was developed to 
cover norm-referenced testing. It should be noted, however, that norms are 
not necessary for the employment of classical test theory (see Lord & Novlck* 
1968, p. '34, for tjie assumptions of classical test theory). The key concept 
in classical test theory is v ariance > not norms'^ In £actt many of the tertns 
of classical test theory can be used when discussing crlterlon-reCerenced 

« * 

testing, but it is not clear that all of the terms make sense when so employed. 

♦ • 

True scores ^ Although there are now several types of mental-test score 

models t classical *test theory Is what is known as a true-score model s On 

this model) an individuarl^s observed score on a test» X» consists of a true 

score > T, plus an error ^core E. A partly philosophical problem^ the nature 

of true scorest'^is discussed by test specialists » Frederick tord and Melvin 

Novick (1968> pp. 39-44). Lord and Novick also discuss the basic assumptions 

« 

of classical and other test theories. ' 

Reliability . The concepts of reliability and validity are very prominent 
in the literature on tq^sting* By definition a. test is reliable to the extent 
that it produces consistent restilts from one application to the next^ andt 
roughly speaking, a test is valid to the extent that it measures (or correctly 
appraises) what it is supposed to measure (or correctly appraise)* These are 
very rough definitions, but they do give one an intuitive handle on the 
concepts. Both concepts are problematic in application* 

In contrast to discussion of all but two types of validity, discussions 
of test reliability generally have at least the appearance of precision, as . 

result of the complex statistical techniques that are employed to investigate 
the reliability ofx^sts. Despite their complexity, however, these techniques 



< 

are auch easier to deal with than the controversial procedures of test vall~ 
dation. These facts may explain why there Is an Inordinate amount of ^ 
emphasis placed on establishing test reliability as opposed to showing the 
validity of a tester In choosing tests, one Is likely to encounter data on 
test reliability rather frequently* ^ 

As defined above, a test Is reliable to the extent that It produces 
consistent results from one application to the next* This is similar to 
what one would expect in a theory of measurement in the physical sciences « 
A ruler is reliable to the extent that it produces a consistent set of 
data from one application to the next« To determine whether a ruler was 
reliable^ one might repeatedly measure the same thing* A reliable insjtru- 
ment would produce very close readings on the repeated measurements* For 
some types of tests covered by educational test theory, such as physical 
or^motor skills tests, this notion of reliability wi,ll suffice* But for 
most educational tests, reliability cannot be determined in terms of direct 
comparison of repeated measurements of ;the ^ame individual using one 
instrument. This is pattly because human beings are often changed by the 
measurement process itself* The act of taking a test can affect the trait 
being measured by the test* Consequently, one would not expect consistent 

scores on repeated measures* In fac^, one might find artificially consistent 

#• 

0 

scores, since examinees sometimes remember their original responses when 
taking the retest* • 

In an attempt to surmount this problem, test theorists employ the notion 
of a parallel test form* Parallel test forms are defined as tests that 
produce parallel measurements* Two measurements are said to be parallel 
• measurements If each individual's true score on the two measurements is the 

15 



sane and If the variance of the error scores for the two measures are eqi^l 
(that Is, the error scores for the two measures are spread but to the same 
extent)* What this notion does for test theprlsts can be seen in the followihg 
quote from a ^standard tejct in this area. Statistical Tt^eories of Mental Test 
Scores by Lord and Novick (1968, p. 4S): *^us parallel measurements measure 
^^^.^actly the same thinff in the same scale, and, in a sense, measure it equally 
well for all persons.** 

The reason that parallel measures are* used' in discussions ok reliability 
is that one common method of defining reliability involves the unobservable 
quantity, aii individual'* s true score; that is, reliability is defined as the 
squared correlation between observed score (X) and true,^core (T): 
It deductively follows from the assumptions of the theory, however, that 

* where K and X.' are parallel measures and are potentially observable. 
So the concept of parallel forms is helpful in developing a usable theory of 
reliability. X 

It is very often judged too difficult, to^ time*-consuming, or too 
expensive to develop a parallel form for a test. Wtien this is the case, 
there is a third, widely-used approach to estimating the reliability of a ^ 
test: estimating the Internal consistency of the test. These estimates use 
only a single test form to estimate reliability. One such procedure is the 
split-half method . In this method one first divides the test into two halves 
that are assumed to be parallel. (There are, of course, many ways to split 
a test. How the splitting is done depends upon the nature of the test.) The 
scores on the two split-halves are then correlated. This does not produce 
an estimate of the reliability of the original testf but of a test only half 



;15- 

, as long. To estimate the reliability of the original te^^, one uses a (\ 
£9raula, the "Spearman-Brown" formula, that estimates the reliability of a 
test that is longer than a given test. 

In addltl6n to split-halt Internal cojisistency, there is another common 
single- form approach to.,£s£imatlng the reliability of a test. This approach 
we shall call the multiple Internal-consistency approach* since it relies on 
the extent *to vhich all of the items intercorrelate with each other. The 
Kuder-Richardson formulas (KR*»20 and KR-21) are commonly used ways of esti- 
mating multiple internal consistency. 

There are other methods of estimating reliability by looking at \he 
internal consistency of a test, and one can find a discussion of these 
approSbhes, as well as a cogent, but somewhat technical, discussion- of the 
concept of reliability, by Julian C. Stanley in a book chapter entitled, 
"Reliability" (1971). . 

There is a significant problem in the use of multiple internal-consistency 
reliability estimates. Informal lot^lc and critical thinking are probably 
heterogeneous notions, but multiple Internal-consistency reliability gives 
higher ratings to homogeneous tests. Hence in building tests to produce high 
multiple Internal-consistency reliability there is the tendency to eliminate 
Items that do not correlate highly with the rest — eVen though such items 
may be very good indicators of some feature that is not highly correlated with 
the other features of these heterogeneous notions. Such items tend to be 
eliminated by test makers interested in making the reported numbers look good, 
and the reason that such items are eliminated is simply that they are non- 
conformists 



1,7 

ERIC 



^ ■ 

-16- 

This problem of using nultlple Inten^al-conslstency as a substitute 
for the original notion of reliability (consistency of repeated applications) 
is no^ peculiar to the {jkHoain of testing in Informal logic and crlticaJ. 

» 

' thinking. It permeates most mental testing by highly respected organizations. 
Informal log^lans are especlklly suited to guard against the resulting 
invitation to equivocation la defense of a test. Ve urge their sharing of 
this insight with others less sensitive to equivocal arguments. 

Validity . It is not enough for a test to given consistent scores; it 
also must measure what we desire to measure. ^^Jlhe process of determining 
.whether a test measures wh^t it is designed to measure is called test 
validation. When sufficient evidence has been accumulated to support the 
claim that the test measures a certain variable, the test is said to be a 
valid measure of that variable. . Actually, this way of speaking is slightly 
misleading, although It is often encountered.^ Cronbach (1971, p. 447) urged 
that it is not the test Instrument per se that is valid, but the Interpre- 
tation of data arising from a particular use of the teste A single test can 
be used in many ways (eege» research^ placement^ job-screeningt grading^ etCe)* 
Some interpretations for a particular us^ may be valid; other interpretations 
of data from the same test used for different purposes may not be valid. 

Although Cronbach* a X point is an important onet most discussions of tests 

* 

are still carried on with references to the 'Validity of a teste'' One should 
understand such locutions as essentially incomplete expressions # A test's 
validity must be understood as Its ability to measure or be a sign of one or 
more specified things when it is given under particular conditions. These 
additional qualifications must be kept in mind if onr encounters talk of the 
validity of a test instead of talk of the vald^ity of .Interpretation of test 
scores. 

o 18 



ERIC 



^ Despite the appearance of precision given l^y the statistical super- 
structure of test theory, many bt the Central cQifcepts In this field are 
not 'at all precise or Well-darlfled. Validity is one such concept. There 
are actually se^feral .somewhat loos«ly-rcldted concepts which come under the 
heading "validity". ,We shall briefly characterize five of the concepts 

♦ 

that one is likely to encounter in discussions of testing. 

A test is said to have face validity if it appears to be a valid test. 

« 

According to the American Psychological Association's Standards for Educa- 
tional and Psychological Tests (1974), face validity is the "mere appearance 
of validity" (p. 26). Most test experts feel that face validity is illegiti- 
mate, but we do not see how to get along without it. <lt appears to be an 
essential element of content validity. ; 

To explain their notion of content validity, test theorists introduce 
the concept, universe of behaviors . Such a universe consists of a set 
. (possibly infinite) of behaviors that a student should be able to exhibit 
if he or she has grasped certain concepts or mastered certain skills. This 
concept, universe of behaviors , is ripe for philosophical inquiry. Although 
we 8hall use It In f^resentlng established theory because It Is part of 
established theory » we have many reservations about Its use* 

We often design tests to determlQe whether our students have learned. 
What we Intended to teach theme For example^ after a unit on propositions! 
logic » we want to be able to determine. whether our students learned some^ 
thing about that topic # We cannot » of course » test for all possible %ehavlors* 
we would expect a successful student to be able to exhibits We are expected 
to try to get a representative sample of the universe of behaviors for which 



10 

f 



we want to test* How one can (without ' leaning heavily on face validity 
judgments) Identify a representative sample of a crltlcal-*thlnklng universe 
of behaviors Is, unfortunately, unclear. However, a test Is said to have 
content validity to the extent that we actually did choose a representative 
sample. It does seem that a test on proposltlonal logic that only astted 
questions of the following form would not have much content validity: ' ^ 
Is the following argument valid? 
. If X» then , 
r* 



Therefore s,. . * 

Assuming that the only difference among^he Items Is the use of different 
single letters In the place of *x* ^^'^ the Items do>not appear to call 
for a representative sample of proposltlonal log^ behaviors. In whatever 
way we choose to Interpret the word '*behavlors". But that judgment appears 
to be a face validity judgment. If not, then .we would have had to have a 
way of describing and Identlfy^g the things In a (Infinite) domain of 
proposltlonal logic behaviors and of drawing a- random or systematic s^ple 

\ 

from that domain* This describing, l^dentlfylng, and sampling makes no sense 
, to us, but we Invite oth^ philosophers to work^n the problem. 

The resolution of this difficulty is importanf since the clarification ' 
of the concepts of face validity and content validity Is essential for. the * 
further development of theories applicable to criterion-referenced testing* 
(Such testing is ol paramount interest to those who wish* to evaluate the extent 
to which students have mastered the content of Informal logic coursepe 

Two further types of validity, distinct from the previous two but related^ 
to each other, are predictive validity and concurrent validity* In many 



er|c t^O 



cases, a tester Is Interested in est;lnatlng tK'e* value of a certain variable 
from the score *on a test. For example, college admlsaions officers would 
like to estlnate a candidate's freshman grade point average, given the 
candidate's score on an entrance exam, such as the SAT or ACT. The variable 

to be estimated is called the criterion^ (This kind of criterion should not 

— : \ . 

b^ confused with that in criterion-referenced testing. Predictiye and 
concurrent validity are not usually'* concerns^ in Criterion-referenced testing.) . 
A test has predictive validity if knowing a .subject's score oi^ the test 

* : 

enables us accurately to predict the value of the criterion. The difference 
between predictive and concurrent validity lies in the 'ten^oral relation 
between the test score and the criterion* One spi^aks of concurrent validity 
when one is interested in a subject ^s standing on the criterion at the time 
of test administration^ while. predictive validity is the concern when one 
is interested in the subject^s standing, on the criterion'' at some future time* 

Concern with predictive validity at one time dominated test theory^ 
^^^Hi^though at the present time more attention is being given to construct 
validity than was given to it in the paste One is Investigating the construct 
validity of a test when one attempts to confirm or dlsconflrm that a test 
measures some hypothetical^ unobservable psychological constrt^cte Intelligence ^ 
anxiety % and critical thinking abllltjy are examples of such^onsf ructs^ The 
^ claim that a cest measures intelligence » we feelt Is a claim about the 
construct validity of a test^ as Is the claim that a test .measures critical 
thinking ability* (Behavlorlsts> who reject the idea of construct validity^ 
would of ^course not agree •) 

The invefttigatlon of construct validity can be viewed as the process 
of placing the specified construct in the context of some larger theory. 




and ascertaining the acceptability of the theory, \he role the construct 

m 

plays 'in the theory, and the relationship between the test and the construct. 

9 

> Construct vali^tion is the process of marshalling evidence ^ 

in the form of theoretically relevant empirical relations 
I to support the inference that an observed response consistency 

has a particular meaning. (Messick, 1975, p. 55). 

Construct validity is viewed by soi!^ as a broad concept that encompasses or 

subsumes all othef types of validity^ 

* 

The model employed in construct validation is the neo-positivist theory 
of confirmation as set forth by Carl Hempel (1965, 1966). Statements about 

the construct in question are located in a hypothetico-deductive system. 

• ■* ♦ 

. t 

Predictions are deduced from the systcfta and either confirmed or disconfirmed. 
The construct validity of the approach to interpretation of test scores is 
supported to'th^ extent that the predictions are confirmed and to the extent 
that the predictions depend upon the relationships among thje test, the con- 
struct, and the other elements of the system (which are also constructs). 
Philosppliers are well-acquainted with the extensive and powerful criticisms of 
this model, but such criticism have as yet had little effect on the use of 
hypothetico-^-deductive models in test theory. (Ihis is not to say that test 
theorists are completely unaware of the problems involved, cf. Cronbach, 197].) 
A standard philsophical problem connected with construct validity is the nature 
of the constructs being investigated. For example,- what (if anything) is 
being referred to ]>y the term "critical thinking ability"? 

Judging tests . How should on^use the. concepts discussed above when 
Judging an available test? That depends to a great extent on the purpose 
for whJ«ir;Sne is using the test. Some remarks a^out general cases, however, 
cauy^rovlde some guidance. One must take care when using the Information 



it 



• 



provlded about tests, since , as we have to «ome extent indicated,' there are 
problens involved with theory behind tests and their interpretation. 

.Dae very general prob^m involved in* making judgments for criterion- 
referenced testing is the source of the vocabulary which is used to talk 
about tests. Classical test theory was developed to cover norm-referenced 
testing." Some of the terms which are appropriate to use when discussing 
norm-referenced testing are not clearly applicable to criterion-referenced 
' ^ testing. There is as yet no theory of and/ vocabulary for criterion- 

referenced testing comparable to the theory and vocabulary which have been 
developed for norm-referenced testing, although progress has been made in 
this area during the last decade. (See Hambleton, Swamlnathan, Algina & 
Coulson, 1978.) Nevertheless, some test theory concepts seem applicable 

V 

to both types -of testing. 

Reliability as a basis for judgment * High reliability seems to be 

desirable for any test although, as Indicated earlier, the commonly^used 

Internal-consistency formulas for estlmatinp||^llablllty ate misleading 

for nonhomogeneous tests.* There are no firm r.equirementa for reliability 

coefficients, althbugh some have been suggested* A \d^dely-quoted set of 

mlnimums was *-c-t forth by Kelley (1927): 

a) To evaluate level of group accomplishment •SO 

b4 To evaluate differences in level of group 

f accomplishments in two or more performances ,90 

c) To evaluate level of individual accomplishment #94 

d) To evaluate ^fferences in levei of individual 

accomplisjiment in two or more performances *98 

These figures of course depend on the assumptions Kelley made about required 

^ fineness of dis9riminatlon arid- the acceptable chances of going wrong. .They 

also partly explain why many professional testing people are so devoted to 

obtaining high reliabilities: Here we have some "objective" standards, and 
♦ 

ERIC * 23 



there are test deVelopinent procedures that generally will 'enable one to 
meet these standards — at a cost. The^cost might be 1) excesslv^e testing 
tlme^ 2) excessive demands on the time of experts^ 3) triviality of Items t. 
and 4) neglect of Important features of the trait (s) for which one Is 
testing. (Remember the pressure for homogeneity of Items resulting from the 

use of Internal consistency foraulas for reliability estimation.) To the 

* • 

extent that we accept these last two, costs* a frequent occurrence* we get 

a reliable, but invalid test. : r ■ * • 

Heavy reliance on reliability is also related to the fact that early / 
test developers were often Interested in predicting the standlng'4>f a 
subject on some criterion. They were concerned with predictive validity. 
According to Lord and Ndvlck^ reliability xan be viewed as predictive 
validity with respect to a parallel test (1968> p. 63^. Consequently ^ 
reliability was seen^>a^ part of the only concept of validity thought to be 
important^ criterion-related validity. ^ 

^ One way to increase multiple intemal**conslstency reliability is to secure 
item homogeneity. Another » according to classical test theory ^ is simply to 
increase the number of items on the test. To illustrate this phenomenon^ 
consider the Cornell Critical Thinking Test, Level X and the Cornell Critical 
Thinking Test, Level Z . Depending on the group from which the data was 
collected, the estimated reliabilities of the tests range from .77 to .87 for 
, Level X and from .55 to .77 for Level Z. However, Level Z is only 52 items 
long, as compared with 71 items on Level X. Using the Spearman-Brown formula, 
one can show that if Level Z were as long as Level X, its reliability estimates 
' would range from .62 to .82, which is closer to the range for Level X^ 



o 24 

ERIC ^ ^ 



.f 



-23- 

^ f • ♦ \ . 

Mo8( currently-used procedures for estimating reliability* even 
including split-half methqds, do not capture the full-blooded notion of 
reliability* They are based on one test administration. Consequently 
instability of measurement over repeated administrations is not taken 
into account. (For a more detailed explanation of this problem, see 
Cureton, 1965. Also^^see section F of the American Psychological Association's 

t 

Standards for Educational and Psychological Tests (197A).) 

The concept .of reliability must be cautiously applied to criterion- 
referenced testing. Some techniques which can be used to Increase the reli- 
ability for no no- referenced testing are not appropriate for criterion- 
referenced, testing. For example, when revising a test, the reliability 
can be increased if one retains items with a "difficulty index" of about . 
.5, meaning that the proportion of examinees obtaining the correct answer 
is .5. (This index is mlsleadinglynamed. It might better be called the 
"ease* index", as suggested by Ahmann & Clock, 1958.) \f instruction has 
been effective, the difficulty (read "ease") index should be high for items 
on a test for criterion-referenced testing, meaning that a high proportion 
of students should answer the items correctly. A test maker who alms for 
it^ms with a .5 difficulty index will then be forced to construct overly 
difficult, recondite, or nit-picking questions. 

In summary, there are a number of traps facing someone pursuing high 
reliabilities, and in accepting the judgments of others who pursue high 
reliabilities. 

Validity as a basis for judgment . In almost all cases of interest to 
the teacher or researcher in informal logic, one roust also ask whether a 
test is valid. As a first step in judging the validity of a test, one should 



ERIC 



25 



-24- 

examine the description «of the test th^t should Jbc .in the manual to see 
whether the test comes close to vhat one seeks. Then, if it appears worth- 
while to go on, one should scrutinize the items very carefully. The best 

to do this is to take the test under the .prescribed conditions, and chfeck 
one's answers against the key, seeking for explanation and resolution of any 
discrepancies. After going through this process, you will have a fairly good 
idea about the extent to which the te^t does what you want it to do. A judg- 
ment based on such an inspection would be a judgment about the so-called 
"face validity" of a test. Going thr'ough these steps makes good Sense, even 
though face validity is a disreputable notion in the eyes of m^ny test 
theorists, making this low regard sotnewhat puzzling. 

Judging a tes^or its content validity, as defined aboveT, requires 
that one adopt and\employ the concent, universe of behaviors . As indicated, 
earlier we shall provisionally do so for the purposes of applying this 
approach. . ^ 

Judgments about the content validity of a test should be aided by the 
examination of the test rationale that should appear in the test manual. This 
rationale should somehow help one identify all the members of the universe of 

¥ 

behaviors, so that one can then decide* whether the test. iteiQS call for a . 
representative sample from 4t« How one actually does al'l this we do not know, 
in one sort of actual practice it appears that content validity is established 

J' 
ig the topics in the rationale quite specific and, if possible » by 

transforming them into types of behavior to be exhibited in types of situations 

it rather than into specific items of behavior ). Tliis list of types of 

behaviors is called a table of specifications. Then face validity judgments 

are made (though they are not called face validity judgments) aboui the 

''This distinction is between behaviors' being dispositions and their being. . 
• ytyiL perfomiances. • 'Oc* 



-25- 

item-produced behaviors' representativeness of the types of behavior in the 
table of specifications* 

A seiCond procedure for establishing content validity in actual practice 
is to gather a larg% number of items that an expert Judges (another face 
validity judgment) to call for behaviors that are representative of types 
of behavior desired* Then a random sample of some sort (or a systematic 
sample) is drawn from the item pool, and the test consists of this sample* 

Mote that both of these pf&cedures for establishing content validity do* 
in fact lean heavily on face validity judgments* Content validity, in the 
only ways we can conceive of its pursuit, consists of organized systematic 
ways of utilizing the face validity judgments of experts* Both of the 
content-validity procedures we have outlined can be followed by someone 
building a test of logical competence, and c^g^e evaluated for their care 
and quality by consumers of such tests* 

These processes of judging pure face validity and content validity are 
applicable to any critical thinking test, whether for norm- referenced testing 
or criterion-reference testing, and whether the test was originally con-* 
strrr.ted for norm- referenced or criterion- referenced purposes* 

A problem that has still dot received much attention in the literature 
Is the problem of making judgments about desirable levels of performance 
for criterion-referenced testing* What level of performance should be 
considered evidence of mastery? This is a difficult question to which 

1 

developing theory does not yet have an answer, even though the test user 
often seeks an answer to this qtjpsrion* (See Popham, 1971 for a sympathetic 
disscussion of problems in the thcury of criterion-referenced testing*) 

Establishing the construct validity of tests Is a difficult task. In 
part because cons true t val i d i ty in Itself is not a crystal-clear notion* 



erIc 27 



-26- 



Much more attentipn has been given to construct' validity In the past 
several years than was given to It In the early days cf the development of 

♦ 

I • 

test theory* But problems still retnaln» 

< 

As described above » the process of making a case for the construct 
validity of a test consists of showing how the construct fits Into some 
larger theory. One way tb do this Is to sHf^ how scores on the test In 
question are related to other variables. So, for example, one would expect 
that critical thinking ab^lllty, since it Involves judgments about statements, 
would be moderately related to reading ability. Many test manuals offer 
lists of correlations of the test In question with other variables. But 
such a list by Itself does not establish the construct validity of a test. 
One must show how the correlations would be expected to follow from a theory 
In which the construct in question is embedded. ^ 

One Important place to look for evidence of construct validity is in 

* 

the relation between a test and other closely related measures • For example^ 
one would expect a high correlation between tests which claim to measure 
critical thinking ability. Such "convergence of measures" gives some 
support to the construct validity of all of the tests involved. Lack of 
agreement could mean several things: a poorly, constructed test, differing 
conceptions of critical thinking, unshared prerequisite familiarity with 
the subject matter, etc. 

One might also expect the constructs measured by *a particular text to be 
unrelated to certain other constructs, that is, one should be able to dlscrlroinate 
between unrelated constructs. Tests measuring unrelated constructs should be 
weakly correlated or uncorrelated (see Campbells -Fiske, 1959). 



ERIC 



28 



-27- 

Construct validity ^tguments for existing critical thlkklng tests are 

I 

either weak ^The Cornell Tests and the Watson-Glaser tesit) or nonexistent 
(the others). This is a connent about the arguments for construct validity, 
not about the construct validity of the tests. ^ 

There are ceritaln positions of which one should be aware when reading 
discussions about construct validity. One may encounter those who demand 

* 

a reducfionist operational definition of each Construct. The strict 
operational is t does not view such a definition proyiding a method of 
measuring the construct in question, but thinks that each test defines a 

« 

different construct. There are many criticisms of this view (for example, 
EnniSf 196A), and even neo-posltlvlsts such as Hempel (1961) regard such 
a position as too rigid, but one frequently encounters this position in 
discussions of educational testing* Holders of this position are opponents 
of the use of coipstruct validity in test appraisal (for an example, see 
Bechtoldt. 1959). 

Another position ttiat one occasionally encounters holds that high 
correlation implies conceptual identity. That is, if two tests correlate 
highly the^ they measure the same thing. Cronbach proposes a counterexample 
to thij^'positloiv •<i'.969) : Comprehension of physical laws will correlate 
highly with scientific reasoning ability, but this does not mean they are 
identical. It may simply be the case that, at present, the best curricula 
do a good job on both and the worst do a poor job On both. 

Construct validity questions are usually associated with norm-reference|i 
testing. Some experts in testing feel that one need not consider construct 
validity when assessing tt^sts for criterion-referenced testing. However, there 
has recently hocn criticism of this position. The literature in this area is 



o 23 



ERIC 



ERIC 



•28- 

< 

just now developing and little can actually be reported at present* but it is 
an area that merits watching and participation by philosophers as It develops « 

Sonetlmes predictive and concurrent validity will be of use to infomal 
logicians > and when they aret the goal is high correlations between the test 
and the criterlvn* . Generally correlations between scholastic aptitude tests 
and levels of later subject matter achievement run about .5 (this is pre** 
dictive validity); correlations of tests with other tests that are testing 
for the same thing go up to^eS when the tests are fairly similar (this is 
concurrent validity when the other tests are administered at roughly the 
same time)* These numbers might serve as rough guides for what one can expect* 
Statistical significance of correlations is generally of little interest for 
predictive and concurrent validity^ since ^that standard is too easy to 
satisfy with a sample of any reasonable size* 

The major problem in using predictive and concurrent validity in ^ . ' 

evaluating informal logic tests is that of finding a criterion that can 

juscifiably be assumed to be better thaii the test in question* 

i 

The suggestions given above for assessing tests are not meant to be 
exhaustive/ Many other considerations enter into the choice of a test 
/.e*g*> time limits^ cost* reading^evel) • 

Constructing Tests V 

After examining available informal logic tests» one might conclude 
that none are appropriate for the purposes at hand* At this point a natural 
move would be to consider the task of constructing one*s own test* r 

Constructing a good multiple-choice test is no easy task* It can/iot 
be accomplished at one sitting^ since, ideally^ test construction involves 
several distinct time-consuming steps. One might object that the time 



30 



-29- 

« 

and effort involved %fould not be justified if one just wants to make up 

« * 

a nidTtenn exam. While we might not ordinarily undertake a grand project 

» 

in such a case, nevertheless, attention to the procedures outlined below 
can improve the quality of many tests. 

As one would expect, the procedures for constructing instruments f^r . 
norm- referenced and criterfon-referenced testing are somewhat different, 
although one instrument might serve both purposes. Eaeh of the following 
list of procedures is an editedjand Abridged version of a presentation in 
Sax, 1974. If ^ou follow these procedures, it is essential to* ask frequently 
whether what you are doing makes sense. Mechanical rule following is . 
dangerous, but an easy trap to fall into. If test specialists are employed, 
the informal logician must monitor the process closely. 

♦Constructing tests for norm- referenced use* The following steps 
describe the procedures one would usually follow in constructing a test 
for norm- referenced use. . ' 

1. Test rationale and objectives are determined. This serves as a 
foundation for the writing of Items. It also serves a part of the case 
for the face, content, and construct validity of the test. The content of 
the test Is determined by the nature of the field In some tests and by 

* * 

the type of objective to be tested for In others. Subject matter experts 
should be involved. 

2« Next, items are written for the test. More, items are written than 

4 

will be Included In the final form of the test. Sometimes several pre- 
liminary versions are constructed. Although the multiple-choice format is 
most often used, other formats are possible (see "Test format" below). 



31 



-30- 

3. The Iteas are administered and the results are analyzed. It is 
desirable to give the proposed items to a fairly large and varied sample of 
the population for vhich the test is designed* For some widely used tests, 
such as the S^T, tens of thousands of examinees take the trial tests, but ■ 
much smaller samples can be used. Most colleges a nd j aniverslties and some 
high schools now have computer facilities which give detailed item analyses 

A 

for machine-scored tests. 

■an 

TWO standard results of an Item analysis are the difficulty index 
(described earlier), and the discrimination index. The discrimination 
index is an attemi^t to indicate how well an item distinguishes between two 
groups of people otherwise identified. Often these groups are the top and 
bottom groups on the total score on the test; if so, then there is a danger 

* 

of overemphasizing homogeneity and neglecting Important aspects of informal 
logic, if any, that do not correlate highly with the ones that dominate the 
test. Seeking high discrimination indices based upon total score helps ^ 
achieve high internal-consistency reliability estimates, a trap mentioned 
earlier. But in any case items with low discrimination indices should be 
carefully scrutinized. Often the cause is a problem in the wording of the 
item. Poorly worded distractors (supposedly Incorrect alternative answers) 
that should not be scored "wrong" can often be detected by item analyses 
that Indicate how groups selecting ^ach dis tractor performed on the total 
test. 

For norm-referenced testing, difficulty indices of .5 are often sought 
because that is a good way to spread people out> on a continuum. Dangers of 
using this standard for criterion-referenced testinft were mentioned earlier, 
and they apply to some extent to norm- referenced t- ng as well. We might 

32 



•ucceed (by following this procedure) in spreading people out on a continuun, 
but the continuum might, as a consequence, be ht l/ttle interest* 

. In any case, the resultj of item analyses, ohe smmSa ramenber, apply 
at best to groups similar to the group that took the test. They might not 
apply at all to a different group; danger here is to do an item analysis 

a 

tising a group that has received no instruction in informal logic, and then 

• * * 

to use. the test on a group that has had considerable instruction in informal 

logic* Opportunities for distortion abound* Reliability estimates should 

# 

» 

be confuted, andvalldity evidence should be considered* 

4. The final 'test form is constructed* Factors such as time limits 
for test taking influence the numbet' of items included in a form. 

5. The final form is administered to another large and varied group 
of examinees and normative data are generated for use in subsequent adminis- 
t rations. Widely used tests are continually being revised and norms are 
frequently updated. 

At this point,, one would have a test that is norm- referenced, but not 
necessarily a good test* If items are chosen with reliability in mind after 
step 3, ttee^v^eiiability of the test is likely to be highV* Even if evidence 
regarding face, content, and. concurrent validity is present, predictive 
and construct validity would still need to be determined. It should be 
apparent that the construction of a good test for norm-referenced purposes 
could take several years. ^ 

Constructing tests for criterion-referenced testing . In constructing such 
Jfests, one follows procedures similar to but not identical with norm- referenced 
procedures. One should also keep in mind that these tests do not have the 
backing theory which norm-referenced tests have. 



33 



ERIC 



1. A general test rationale Is prepared 
* 2. The universe of behaviors to be covered by the test Is specified. 
These specifications Indicate what a student who has mastered the universe " 
should be able to do, although It Is not clear whether the "behaviors" are 
to be dispositions oV performances. 

3* Test Items are written which conform to the specif Icatloi^s of 
step 1; How to do this Is not clear, because In most content areas, the 
nature of the relationship between the items and the universe of behaviors 
is not clear. Be. that as it may, one next makes (ideally) a random selection 
from all such possible test, items, but this is usually not possible since most 
universes subsume an Infinite number of items. Instead, one might try to 
assure that the sample of items selected is representative by comparing the 
items with the universe specifications (a face validity judgment). 

4. If one has to choose from among the^ Items selected in step 2 (to. 
adjust the test for proper time length, for example), those Items that 
discriminate most clearly between those who have had. Instruction and those 
who have not are usually preferred, other things being equal. 

5. Standards of competence are determined. There is controversy over 
whether this step can or should be taken. Although many criterion-referenced 

♦ 

tests have cut-off scores^ Gldss (1978) argues that procedures used in , 
determining cut-off points are indefensible. 

6. The test is. administered under, conditions that conform to the 
ualverse specifications (l.e.> if the universef of behaviors deals with 
written criticism of written argflments, an oral test would not be appropriate). 

7. Student performance is assessed by comparing test results with 

the specified standards of competence. One checks to see whether the ratings 
of the students make sense. 



-33- 

# 

r 

Aa vith norn-refelrenced testings these procedures lend thenselve^ to, 
. continuing test revision and improvement over tine* Items that discriminate 
most clearly, between those who have and have not mastered the material are 
retained and other items are sc.rutinized for deficiencies. Various tcnp"^ 

of a test can be developed by taking different samples from the domain of 
test items conforming to the universe specifications.'* 

The controversy over step 5 points to a popular misconception* about the 
nature of criterion-* referenced testing. -Some people believe that a test for 
this purpose is essentially one which classifies an examinee as competent 

« 

or incompetent with respect to some skill. This is not 'Widely viewed by 
test theorists as a necessary chai. cteristic of such a test, although some 
theorists view it as highly desirable for practical applications. What is 
necessary is that the score be directly interpretable in terms of behaviors 
or performances. Deciding what level of performance constitutes competence 
is an extra step. * . 

At least some of the procedures outlined here can be helpful even 
where a teacher simply wants a raid- term exam that will only be used once. 
At least stating the rationq^e^nd specifying crucial behaviors (either 
dispositions or performances) cah be helpful in thinking through the test 
specifications. 

Test format . An informal ..logic teacher interested in constructihg an 
achievement test id faced with the problem of deciding what type of item or 
items to construct, e.g^,. multiple-choice or essay. There is an amazing 
variety of item forms used in tests, but for discussion purposes, we will 
classify them into three main types: multiple-choice, short-answer, and 
essay. These types are distinguished by the latitude a student has in con- 
structing an answer. Multiple-choice tests allow a student to pick only from 

^5 



specified possible answers* There are many kinds of multiple-choice forms 
available (true-falset natchlngt niUclple- response t ctCe) and many Ingenious 
variations have been invented '^to measure *ansride range of bbj actives (see 
Anderson^ 1972t and Wesman» 1971) « The short-answer itemt such as a 
sentence-completion Item or Identification questiont allows students more 
freedom In that they must supply the answer themselves* Students are 
limited to some extent by the space allowed for the answer and the necessarily 
limited nature of the question asked* The essay or open-ended answer allows 
students a great deal^of freedom itx choosing what they believe to bc^ a good . 
answer* 

The problem of the type of Item to employ is a thorny one* Test theorists 
have traditionally favored multiple-choice items, v the type for which the 

V 

• f 

cor ept, universe of behaviors , is best adapted. Such items are easily and 
inexpensively scored and are not susceptible to errors of measurement caused 
by inter--grader disagreement* (But comparable errors slip in through . the 
writings of the item and the directions — which always leave room for differing 
interpretations by examinees.) The data produced by such tests can be 
analyzed by the sophisticated statistical techniques available to test 
theorists. On the other hand, students might recognize a multiple-choice 
alternative as correct-when they would not have been able to recall the answer 

♦ 

had they been asked a short-aijswer question* 

Essay questions require even more effort and ability on the examinee's 
part since the structure and content of the answer roust be supplied by the 
examinee* In many cascs^ Informal logicians will want to assess the type 
of knowledge that essay questions seem best suited to assess* Unfortunately, 
the concept, universe of behaviors . Is especially problematic with essay 
questions* 

^ - 36 



Since the type of item one chooeee to construct depends on the type of 
knowledge phe Is trying to assess, it could be useful to. h^ve some classlfl- 
ficatlon scheme for types of objectives and "behaviors'*. S^njamin Bloom 
and his assodiates have developed a popular sqheme for classifying cognitive 
educational objectives (Bloow, Engelhart, Furst, Hill,. & Krathwohl, 1956). 
The hierarchical list o^^terms developed by Bloom, et al., is as follows: 
knowledge, comprehension, application, analysis, synthesis, and evaluation. 
Although widely used In the literature on education, t^s list embodies a 
host of philosopliical and other problems. They are in the list itself and • 
its application to testing.. The simple problem of classifying an objective 
(say, "ability to Identify and unstated assumption") gives one 'doubts about • 
this list. Bloom, et al., actually classify this objective under A. 10, 
"Analysis of Elements", but it is not clear why they place it there Instead 
of somewhere else. 

Although the tradltlona^l test-theory view has been that multiple-choice 

« 

items are to be preferrejl whenever possible* some writers have'^cently 
proposed that> for cplstemologlcal reasons, multiple**cholce Items are the 
least desirable type* 4^ugh Petrle (In press) has argued^ that we should 
vi$\f a test as the introduction of a disturbance that the examinee will 
correct if the desired achievement has been attained. By limiting the 
possible .responses to a test Item (the disturbance), we limit possible 
novel responses that would also counteract' the disturbance. New theories 
proposed by Petrle and others will no doubt Jiave an influence on the future 
of testing and evaluation. * 



37 



-36- 

/ 

Ualn^ Tests ami Other Techniques In Evaluating Informal 
Logic Students, Courses, and Curricula 

r 

Even the most well-constructed test Is not worth much If It is not 
used properly. A test may give us some Information about the perf>MnQance 
level of our students or the effect on test scores of different approaches « 
to teaching informal logic* But we ^do not give tests just to obtain such 
information* We use the information to make Judgments about student achieve- 
inent or the merit of Innovative teaching methods. When we do this' we are 
engaged In the process of evaluatloh. 

There are many different things that we evaluate In education. They 
cannot all be evaluated the same way. Even In any one particular area, 
there is not universal agreement about how evaluations should be carried out. 
Nevertheless, we can offer some advice on the use of tests and other techniques 
for the purpose of evaluation. 
Testing and Evaluation 

Tests are certainly the most widely used Instruments in ^evaluation ^udies. 
Although there are many legitimate uses of tests, they can also be misused. A 
carefully prepared plan for an evaluation study can help guard against the 
misuse of tests* We shall discuss the use of tests to assess student perform- 
cnce and the employment of tests in various experimental designs* Tests can 
also be used for other purposes (e*g* , placement), but the following topics 
will probably be of greater interest to those interested in teaching and 
research in 5 normal logic* 

The use of criterion-referenced testing in eyal uatlon. Criterion- 
referenced testing is the assessing of a student *8 mastery. We cannot 
endorse the current attempt to put this purpose in [terms of a universe of 
behaviors, but feel that no matter how we conceptualize the basis, criterion- 
referenced testing of some sort is useful. 



' . •> -37- 

I' • . 

* • 

One nay use the test results to assign grades; or to determine which 

Students should advance to the next unit of study and which should remain 

•/ 

behind for more work* In either case» one must face the problem of specifying 
what level of performance indicates mastery *of the material studied (or what 
levels of performance correspond to a certain degree of mastery) • Enperts 
in this area of test theory have no help to offer us on this probiem* but it 
is essential that the person who does decide be thorpughly familiar with the 
test rationale, its items, and the subject matter of informal logic* 

The use of norm-referenced testing in^¥S1^iat±on .s If one is interested 

in comparing the relative standing of groxxpa (e«g.» informal logic class 

% If 

vs. traditional class) on some variable (e.g.) critical thinking ability), 
then one employs norm- referenced testing* In selecting a test one must 
determine whether the face or content ^lidityand construct validity of 
a test under consideration for use as a mea^^ring instrument are appropriate 

for one's own purpose. It is here thatv a list written course objectives 

* \i 

would be helpful, for the test specifications for a particular informal 
logic test might not match the course objectives* A test 'may, however, 
measure some things considered important in the course specification and 
coulJ therefore serve as a partial measure of the constructs under investi- * 
gat Ion. Unfortunately, people all too often rely exclusively on the title 
of a test for information abQUt what the test measures* There is no sub-* 
stitute for a careful, critical examination of A test and its nianunl. 

Experimental design . Just as important as choosing a measuring 
instrument Is the choice of an experimental design* Even the best instru- 
ments cannot produce useable data unless a proper testing schedule is 
followed. We will briefly consider some of the more popular designs. The 



ERIC 



39 



-38- . 

reader is encouraged ,to examine more thorough treatments of this topic, 
such as, Campbell & Stanley (1963), Winer (1962), or Wi>rsma (1975). 

There Is one "design" which most experts do not consider a design af 
all. In this "design", the results of a pretest and a posttest on an 
experimental group are compared. (A pretest is a test administered before 
a treatment. A posttest is a test administered after a treatment. A 
treatment is any deliberately-introduced change in the environment of the 
group under investigation. For .example, instruction in informal logic 
would be a treatment.) The problem here is that even if the posttest scores 
are higher than. the pretest scores, one has no reason to attribute this 
inference to the treatment. Any number of other facjors (e.g., maturation, 
familiarity with the subject-matter induce^ by the pretest, etc.) could be 
responsible for the higher scores. 

In order to draw meaningful conclusions from the preceding experiment, 
we also need a control group with which to cbmpare the experimental group. 
By "control group'' we mean a group that is supposed to have the same 
characteristics as the experimental group Except that it does not receive the 
treatment. The preferred ^method for obtaining equivalent groups is to choose 
both groups randomly from the population under study. (There is some dis- 
agreement among theorists about whether randomly selected groups are equivalent 
by definition or whether they arc only highly likely to be equivalent. The 
former position seems to be the commonly-accepted view. Thus this concept ot 
equivalent groups emplcjyed by test theorists and statisticians differs from 
the ordinary concept.) 

The simplest type of experimental design is the posttest-only control 
group design in which a posttest is administered to the control group and 



ERIC 



40 



experimental group. One ttien Icoks for a significant difference between 
th<» iM»An «ror«»« nf the tw groups. This design is cimidc to set up and is 
not as widely used as it could be. On the negative side, the statistical 
tests involved are not as powerful as those used in some other designs, 
and there is often a lingering suspicion that the groups were not equivalent 
at the beginnfn^^ despite the random-selection process. 

•The most popular true experimental design is probably the pretest- 
posttest control-group design: In this set- »n both the control and experi- 
mental groups are administered a pretest and a posttest. One often-used 
strategy for analyzing the results is to compute gain scores (the difference 
between pre- and posttest scores) and to test the difference between mean 
gain scores for each group for statistical significance. This approach is 
challenged by some experts, however, (see Cronbach & Furby, 1970). Analysis 
of covariance is now one of the recommended procedures for analyses of data 
from this design. Roughly speaking, analysis of covariance attempts to coii.pare 
posttest scores while statistically holding pretest scores (and perhaps other 
variables) constant. This design gives one a check on group equivalence 
through a comparison of pretest scores. However, it requires twice as many 
test administrations as the posttest-gnly control-group design and there is 
a problem with attempting to generalize the unpretested population. ^ 

One big stumbling block to the use of true experimental designs is 
the difficulty in arranging for random assignment of individuals to groups. 
Most institutional settings are not flexible enough to permit randomization, 
and, in some cases, there are also ethical and political problems with 
this manipulation of subjects. In such cases, which in educational insti- 

« 

tut ions is most cases, researchers turn to qua s i-experl mental des 1 gn s ^ 



41 



-40- » . 

t 

which are similar to true experiuental designer but differ in a few inpor^ 
tant reapects* 

The roost widely used quasi-experimental design is the "nonequivalent" 
control-group design. This design resembles the pretest-posttest control- 
group design except that subjects are not assigned to groups at random. 
The groups are taken as- they are found in some institutional setting. This 
means that extraneous factors that influence the selection process may turn 
out to be responsible for any significant differences which are found. This 
possibility must be carefully considered when weighing the evidence collected. 
If one has information about relevant characteristics of the subjects, this 
can sometimes be taken into account in the statistical treatment of the data 
by means of techniques such as analysis of covariance. Since this design is 
often the only one available, readers interested in research that will be con- 
ducted under conditions that preclude te use of true experimental designs 
should consult more detailed treatments of this and other quasi-experimental 
designs (e.g., Campbell & Stanley, 1963, Kerlinger, 1964, or Airasian, 1974). 

Regardless of the type of experimental design chosen, one problem that 
any researcher giust face is generalizing from a sample to a population. 
Unfortunately, the populations from which samples are drawn in educational 
research efforts are almdst never the populations over wh^ch it would be 
desirable to generalize. For example, one might draw random samples (for 
a control group and an experimental group) from all the students taking 
informal logic during 'a particular semester for the purpose of evaluation 
of a certain method of teaching logic. The population to which one would 
like to generalize is that of all informal logic students, including next 
year's group. But the population from which the sample was drawn was much 



■ore restricted. (Each nenber of a population ontst have an equal chance of 



being aele'cf.ed In a random assignment.) Sometimes arguments are offered' 



for the typicality of groupft chosen in an attempt to generalize over a 



larger population (Campbell & Stanley, 1963, discuss this difficulty, calling 



Most conroercially-avallable tests contain tables of norms with which 
one can compare experimental groups or individuals. Such comparisons are 
useful for sucgestlng hypotheses about differences between, e.g., national 



. norms and locally-collected data. They can also give an individual an idea 
of how he or she stands compared to a norm group. The more accurately the 
norm groups are described, the more readily one can choose the appropriate 
comparison group. One should not, however, view normative data as a 
substitute for control-group data collected by the experimenter. 

Statistical significance . In educational research, a result that, 
given the assumptions, is the sort that could have occurred by chance less 
than five (sometimes one) times out of a hundred is generally deemed to be 
statistically significant (this is known as the .05 level of significance). 
Beware of this approach. With very large groups statistical significance 
can be attributed to differences that are for practical purposes very small. 
It is therefV>re good practice to ask about statistically-significant differ- 
ences whether they arc also practically-significant as well. This requires 
that one immerse oneself in the situation and inquire about the economic 
and human co?t of producing a given difference, and about whether the 
difference produced is large enough to be concerned about. 
Othc/ Approach es t o Eva luation 

Thus fcir we have discussed tests and their use as evaluation instruments. 
The devoting of a large proportion of this paper to tests and teat theory 



it the problem of "external validity"). 




43 



-42- 

♦ 

reflects the extent to which this approach to evaluat;ion dominates education 
at the present tine* Are there any other approaches to evaluation? 

« 

The answer to the preceding question depends on the extent to which 
the testing model can be extended to cover everything to be evaluated by 
educators in general and Informal logicians in particular* According to some 
test experts, roost of the efforts expended in evaluation projects should be 
directed toward the construction and perfection of good tests. For them, 
evaluation means measurement > and measurement means the use of some instrument 
covered by some test- theory model* ^ 

Responsive evaluation . For some evaluators, however, testing is not the 
whole of evaluation or even the most important part* One group that employs 
a soroewliat different approach to the evaluation of educational programs and 
materials is the Center for Instructional Resources and Curriculum Evaluation 
(CIRCE) at the University of Illinois at Urbana-Champaign, directed by 
Robert E. Stake. Stake is suspicious of traditional evaluation methods 
since they are what he calls "pre-ordinate" (Stake, 1967, 1976, Stake & Hoke, 
1976). That is, they depend on prespecified notions of how a successful program 
or course must appear and be. By looking only for certain kinds of results 
(usually in terms of test scores), traditional evaluators may overlook things 
that would be considered just as valuable as the prespecified objectives, 
if they were noticed. 

CIRCE evaluations tend to include a great deal of narrative or "portrayal" 
material gathered by observers. These observers make note of things they 
feel are important and judgments of students, teachei-s, parents, and school 
administrntors. Rather than evaluating a program or course strictly in 



44 



texAs of test scores* Stake *s **responsive evaluation** tries to employ a aore 
holistic approach. 



Semi-structured evaluation. Although many of Stake's criticisms of 

* " * *^ means 



traditional evaluation methods are certainly to be jKeeded, it is by no 
clear that tests should be abandoned or even demoted in importance. Rather we 
feel that evaluators must become more cautious in their interpretations of 
test results and must become more flexible in their use of ot^er approaches 
to evaluation (e.g., by including the reports of trained classroom observer 
in evaluations). The need for flexibility and new evaluation methods is 
especially pressing in research in informal logic. The Illinois Rational 
Thinking Project is examining several methods for evaluating curriculum 
materials in critical thinking. We have found that tests alone do not provide 
all the information we would like, although we still consider them an i\idis- 
pensible part of evaluation. We are beginning to .examine other evaluation 
methods, some of which are indicated below. Whether these techniques will 
prove useful remains to be seen, but we invite others to experiment with them 
' and hope others interested in informal logic will make additional suggestions. 

Many skills that informal logicians wish to teach their students are 
not amenable to evaluation by means of traditional tests. For exan^le, 
the application of informal logic skills in conversation and in everyday 

* 

arguments is an extremely complicated process. By observing human inter- 
actions , that are^ more or less^tructured,* one can begin to get a feel for 
students* abilities in this area. On the more structured side, debates 
provide a format that might even produce quantitative data i'f some type of 
scoring system is employed. Students must both construct and criticize 
arguments in a debate » so this par^icVilar activity is one which teachers^of 
informal logic should consider Using in their classes. Scoring procedures 

ERIC » : -45 



ERIC 



' -44r ^ 
need to be developed by informal logicians « since what they perceive as 
good and bad In a debate is different from what the rhetorician 
sees as good and bad* 

Debates, while useful inr the evaluation of instruction^ are not very 
realistic forums for the application of logical skil^fiC^'^e problem with 

r 

< • 

them is that they do not allow participants to change their positions when 
they hear a good argument from an opponent (see Scrlven» 1976) ♦ Small 
group discussions might provide a more realistic setting for the application 
of logical skills in a context likely to be found in everyday life. An 
Interview situation might also provide a good context in which to evaluate 
the ability of students to construe: and criticize arguments. Like debates, 

* * t 

discussions and interviews might lend themselves to analysis by means of a 
scoring system, especially if the topic .is one in which certain Unes of 
argument could be expected. However, remembering Stake's criticisms of 
traditional methods, one should not rely exclusively on a scoring key when 
evaluating something as open-ended as a discussion or interview. An eval- 
uator must be able to spot unforeseen moves that would indicate that students 
are employing the skills and concepts that have been taught. 

Surveys and questionnaires . While surveys and questionnaires are, used 
to some extent at the present time, they are not being employed as fully 
as they could be. At the primary and secondary level how a course is 
perceived by other teachers, parents, . administrators, and, especially, the 
students themselves can be important factors in the success or failure of 
a course. At the college level, how the course is perceived by students 
is still a very important factor. Whether students view a course as training 
in the rational pursuit of truth or as training in sophistry will certainly 
affect our evaluation of the couraee Some attempt is now made to analyze 

. 46 



-45- 

•uw«y« on the basis of traditional test theory nodels» but it is not at 
all clear "that that model Is appropriate. More investigation is needed in 
this area. 

Long-term follow-up . Probably, the most neglected approach to evaluation 



is the long-term follow-up study. 'While this approach might fit under 
traditional testing models* this depends on the type of follow-up^ performed. 
Unfortunately these kinds of studies are rarely done. This is a rather 
sorry state of affairs since the effects of most educational programs are 
meant to be lasting. However, most programs and courses are (evaluated at 
the end of the treatment period and follow-up studies are very expensive 
and difflculy. Llndqulst (1951) dls'tlnguishes betweem Immediate objectives, 
those which end-of-course evaluations measure, and ultimate objectives, the 
attainment of which can perhaps only be evaluated at some "time long after 
the treatment period. We suspect that informal logicians will be especially 
concerned with ultimate objectives, since informal logic courses are meant 
to help people reason in everyday situations throughout life. Without lorp,~ 
term follow-up-studies, it is difficult to see how one could decide whethe 
ultimate objectives had been ^^^|t!4:alned. The financial and logistical proble 
which are Inherent in long-term studies are obvlous> but this fact, does not 
reduce the need for such studies e Rather It counsels us to be well-prepared 
(and supported) before venturing such a study # ^ * 

S ummary 

We have presented a brief overview of the state of testing and evaluation 
as applied to Informal logic e 

Currently available general tests In this area were described and 

* • 
^criticized. They are the Watson-Glaser Critical Tliinking Appraisal , the ^ 

47 ' 




-46- 

Cornell b»t€lcal Thin king Test. Uvel X . the Cornell Critical Thinking Test. 
Jjgvel^, and (if one groups them together) the Instructional Objectives 
Exchange Indexes. 

TWO standard, generally-desirable characteristics of tests were explained: 
reliability, the tendency of a test to give the sane 'result when given again 
in the same circumstances; and validity, the characteristic of nteasuring 
(or appraising) what the test is supposed to measure (or appraise). Test- 
retest, parallel form, and internal consistency methods of estimating 
reliability were described and criticized, and the danger of using multiple 
internal-consistency methods for tests of heterogeneoxis traits was noted. 
Five common approaches to validity were considered: face, content, construct, 
predictive, and concurrent. Current test-specialist contempt for face 
validity was questioned; the notion of content validity was challenged 
because of its intimate relation to the problematic concept, universe of 
. behaviors; construct validity, the idea that a &8t is valid to the extent 
that its results fit into a good theory, was explored and found vague, but 
not uselessly so; and predictive and concurrent validity were deemed to be 
generally of little use to Informal logicians because of the lack of an 
outside criterion to validate informal logic tests. 

We distinguished criterion- referenced testing from norm-referenced 
testing. In doing so, we suggested a shift in testing-theory ^'ocabulary 
from "test" to "testing", the reason being that a test developed foi one 

9 

purpose could conceivably be used for the other. Criterion-referenced 
testing has the purpose of assessing degree of mastery; norm-referenced 
testing has the purpose of discriminating between and among students and 
groups. Norm-referenced testing theory is well-developed, though there are 



ERIC 



48 




i 



problems, Including the. built-in invitation to develop reliable, invalid 
tests. Criterion-referenced testing theory is in its infancy, and in particular 
has problems with its lack of guidelines for determining a leVel that shall be 
deemed mastery and with its generally-accompanying concept^ universe of 
iehaviors. It is not clear whethe the recommended random sample from the 
universe is to be taken of behaviors as dispositions, of behaviors as 
performances, or of items. 

Some procedures for developing one's own informal logic tests were 
suggested, and various types of evaluation instruments (in addition to the 
heavily-emphasized* multiple-choice tests) were described and recommended. 

Experimental desigi^s were considered. We do not recommend the simple ' 
pretest-posttest design unless there is a control group. But even if one 
has a cont>rol group, experimental theory calls for the random aefecHqn of the 
subjects for the experimental and control grotiM from the popujitlon about 
which we want to draw c onclusions . This is impo8^ib4&J.£-tJ^ant ,to draw con- 
clusions about next year's classes, for example, so compromises are struck, 
^e compromise Is to draw one's initial conclusions only about the group from 
which one did manage to draw a random sample, and then attempt somehow to infer 
to the larger group on the basis of it.s typicality. A second compromise that is 
often struck is to pick one's experimental and control groups not at random, but 
so that they are as comparable as we can get them, and then to assume that they 
are comparable enough, or to use statistical techniques that, it is hoped, 
compensate for incomparabllity (this is called a "quasi-experimental design"). 
There ^s no perfect resolution of these problems. 

As we proceeded in laying out this introductory treatment of testing and 
evaluating in informal logic, we broached a number of philosophical problems 



( 



ERIC 



'49 



V 



-48- 

that are embedded In this field. We did not attempt an exhaustive list of 

t * 

such probleiii8t but did allude to the following: Uhat aense can be made of i; 

• . • ••* 

random sampling from a universe of behaviors? What is a *'behavior"? Uhat 

is a true score? What is critical thinking? Whilst, is rational thinking? 

What is informal logic? What, is the relationship between test performance 

and mental traits? What is mastery and in general how can mastery be 

inferred from test performance? Is it plausible to judge a test to be 

valid on the ground that it fits into a well-confirmed theory » as is recom- 

mended by the cbnstruct-valldlty approach? If eot then what rules and 

procedures can be followed to make such judgments? IiJhat constitutes typicality? 

Can one specify guidelines for generalizing beyond a population from which 

a random sample was drawn? If so^ what are they? Can one specify guidelines 

for acceptable ^alternatives to random sampling? If sot what are they? 

We mention these problems partly in order to warn interested informal 
logicians that th6 field of testing and evaluation is not out there all 
ready to provide a neat^ clean service to us. But we do so also in the hope 
that some philosophers will undertake work on these or other evaluation- 
^ related problems with the intention* of offering theoretical help in this 

area. In vieAr of informal logicians* practical interests in evaluating' 
informal logic competence, it should be apparent that philosophical work on 
w these problems would be a socially^ significant activity* We also feel that 
such work is intrinsically interesting and philosophically important. 

* We also hope that other informal logicians will develop various kinds 
of instruments for evali^fring informal logic competence. More are needed, 
and if we do not do it, someone else will^-someone who knows even less about 
it than we do. 

4 

ERIC 1 



V 



' -49* 
References 

^Ahman* J. S., & Clock, M. D. Evaluating pupII growth . Boston: Allyn and 
Bacon, 1958. 

Alrf^an, P. VT'^eslgnlng summatlve evaluation studies at the local level. 
In W. J. 1»opham (ED.). Evaluation in Education . Berkeley, Calif.: 
Mc&jtchan, 1974. Discussion of choosing an experimental design under 
the constraints imposed by typical in8<tltutional settings. 

American Psychological Association. Standards for educational and psychological 
tests . Washington, D.C. : American Psychological Association, 197A. 

Anderson, R. C. How to construct achievement tests to assess comprehension. 
Review of Educational Research , 1972, 42, 145-170. Contains practical 
suggestions for vnritlng types of items which are useful in criterion- 
referenced tests. ^ 

Bechtoldt, H. P. Construct validity: A critique. American Psychol6gist . 
1959, U. 

Bloom*, B. S., Englehart, M. D., Furst, E. J., Hill. W. H., & Krathwohl, D. R. 
Taxonomy of educational objectives, handbook I: Cognitive domain . 
New York: David McKay, 1956. 

Bar^, 0. K. (Ed.) The seventh mental measurements yearbook (2 vols.>. 
Highland Park, N.J.: Gryphen Press, 1972. Standard sourcebook for 
psychological tests. 

Canpbell, D. T. & Fiske, D. W. Convergent and discriminant validation by 
the multltrait-multlmethod matrix. Psychological Bulletin , 1959, . 
56, Sl-105. . . 

Campbell, D. T. & Stanley, J. C. Experimental and quasi-experimental designs 
for research on teaching. In N. C. Gage (E;d.), Handbook of research on 
teaching . -Chicago: Rand McNally, 1963. Classic article on experimental 
designs for educational research. 

Cronbach, L. J. ^Validation of' education measures. In P. H. DuBois (Ed.), 
Procoedln ps of the 1969 Invitational conference on testing p roblems. 
Princeton, N.J.: Educational Testing Service, ,1969. Preliminary 
version of Cronbach, 1971. 

Cronbach, L. J. Test validation. In R. L. Thomdike (Ed.), Educational measurement 
'Washington, D.C.: American Council on Education, 197r. A seminal 
paper on test validity. Some parts are technical, but it is recommended 
reading, nonetheless. 

Cronbach, L. J. & Furby, L. How we should measure "change"— or sho\ild we? 
• Psjyholoslcal B ulletin , 1970, 74^, 68-80. Arguments against the use 
'of gain scores and the pretcst-posttest control group experimental 
design. 

Curcton, E. E. Reliability and validity: basic assumptions \nd experimental 
designs. Educational and Psychological Measurement , 1965, 25, 327-346. 



ERIC 



51 



-50- 

ft 

• , ^» ' 

Ennls, R. H. An appraisal of the Watson-Glaser critical thinking appraisal. 
Journal of gducational Research , 1958, 52, 155-158. Covers earlier 
versions of the Watson-Glaser test (Forms Am and Bm). 

Ennis, R. H. Assumption-finding. In B. 0. Smith & R. H. Ennis (Eds.)» Language 
and concepts in. education . Chicago: Rand McNally, 1961. 

Ennis, R. H. A concept of critical thinking . Harvard Educational Revlev > 1962, 
32 , 81-111. With a few minor amendments, this, notion of critical 
thinking is the basis for the Cornell Critical Thinking Tests. 

Ennis, R. H. Operational definitions. American Educational Res ear clr Journal , 

1964, i, 183-201. 

Ennis, R. H., Gardiner, W. L., Morrow, R., Paulus, D., & Ringel, L. The Cornell 
Cla ss Reasoning Test . Urbana, 111.: Illinois Critical Thinking Project, 
1964. 

Ennis, R. H., Gardiner, W. L., Mbrrow, R. , Paulus, D., Ringel, L., & Guzzetta, J. 
The^ Cornell Condit ional Reasoning Test . Urbana, 111.: Illinois Critical 
Thinking Project, 1964.^ 

Ennis, R. H., & Mlllman, J. Manual for ComeU Critical Thinking Test, Level X 
and Cornell Critic al Thinking Test, Level Z . Urbana, 111.: Illinois 
Critical Thinking Project, 1971(a). 

Ennis, R. H., & Mlllman, J. The Cornell- Critical Thinking Test, Level X. 
Urbana, 111.: Illinois Critical Thinking Project, 1971(b). 

Ennls, R. H., & Mlllman, J. T he Cornell Critical Thinking Te8t^< Level Z. 
Urbana, 111.: Illinois Critical Thinking 'Project, 1971 (<f). 

♦ 

Ennis, R. H., & Paulus, D. Critical thinking readiness in grades 1-12 (phase I: 
deductive logic in adolescence). Ithaca, N.Y.: Cornell University, 

1965. (ERIC Document Reproduction Service No. ED 003 818). 

* 

Glaser, R. , Klaus, D. J. Proficiency measurement: Assessing human i)erformauce. 
In R. M. Gagne (Ed.), Psyc hological princ iples in systems development . 
New York: Holt, Rlnehart, and Winston, 1962. 

Glass, G. V. Standards and criteria. Jou rnal of Educa tional Measurement, 
1978, 15, 237-^261. ^7; = 

# « * 

Guilford, J. p., & Hcrtzka', A. f. Logical Reasoning (test). Orange, Calif.: 
Sheridan Psychological Services, 1955. 

Hambleton, R. K., Swamlnathan, H., Alglna, J., & Coulson, D. B, Criterion- 
referenced testing and measurement: a review of technical vissues and 
developments. Revie w of E ducational Ro soarch^ 1978, 48, 1-47. A 
review of the state-of-the-art in criterion-referenced testing. Some 
technical, sections. ' 



-51- 



Raney, W. V. The wcritlcal Inference test . Wilmette, 111. J William V. 
Uaney Associates, 1975. ' 

Herape^, C. G. A logical appraisal of opera tlonism* In P. G. Frank (Ed.), 
The validation of scientific theories . New York: Collier, 1961. 

J ' 

Hempel, C. 0. Aspects of scientific explanation . Ney^ork: Free Press, 1965. 

* ' > 

Hempel, C. G. Philosophy of the natural sciences . EngXewood Cliffs, N.J. : 
Prentice-Hall, 1966. ^ ^ 

Instructional Objectives Exchange. Judgment; deductive logic and assumption 
recognition, grades 7-12 . Los Angeles: Instructional Objectives » 
Exchange, 19.71. 

Kelley, T. L. Interpretation of Educational Measures . Yonkers, N.Y.: World 
Book Co., 1927. 

# 

Kerllngert F. N» Foundations of behavioral research ♦ New York: Holt, 
Rlnehart, and UinstonK 1964* In«*depth discussions of experimental 
designs # 

Lindqulstt E« F« Somq ]>relimlnary considerations in objective test con-* 
struction.^^Tii £• F. Lindquist (Ed.)» Educational measurement s 
Washing ton, D.Cvj American Council on Education, 1951* 

Lord, F* M/, & Novick/M* R* Statistical^ tneories of mental test scores * 
Reading, Mass-*-^ Addison-Wesley, 1968. The definitive work on norm- 
referenced test theory* 

Messick, S> The standard problem: meaning and values in measurement and 
evaluation* American Psychologist s 1975, 30, 955-966. 

Fetrie, H. Against objective tests: a note on the epistemology underlying 
current testing dogma. In Mark Ozer (Ed.), T oward the more human use 
of human beings: Issues In the application- of cybernetics to assessment 
of children (forthcoming)* 

Popham. W. J» (Ed.). Cr 1 t er i oii^^ef erenced measurement * Englewood Cliffs, 
N.J.: Educational Technology Publications, 1971* Good introduction 
to criterion-referenced testing. 

Sax, G. The use of standardized tests in evaluation. In W. J. Popham (Ed.), 
Evaluation i n ed ucation. Berkeley, Calif*: McCutchan, 1974. Contains 
a comparison of criterion-referenced and norm-referenced tests* 

Scriven, M. Reasoning . New York: McGraw-Hill, 1976* 

^ell. D. E. Evaluat ion aptitude test-manual . Munster, Ind.: Psychometric 
Affiliates, ifs'a. 



-52- 



Smith.R. A. Regaining educational leadership i Critical egsays op PBTE/CBTE , 
behavioral objectives, and accountability * New York: - John Wiley, 1975. 

Stake, R. E. The countenance of educational evaluation. Teacher's College 
Record , 1967, 68, 523-5A0. 
♦ <> 

Stake, R. E. To evaluate an arts program. Journal of Aesthetic. Educatio n, Jr976 
10, 115-133. ~ - 

Stake, R. E., & Hoke, G..A. Movement and dance In a downst^te district. The 

National Elementary Principal , 1976, 55. ^ 

Stanley, J. C. Reliability. In R. L. 'Tfiorndike (Ed.), Educa tional measurement . 
(2nd ed.). Washington, l^C: American^ Council on Education, 1971. A 
thorough treatment of theHopic, Somewnnt^ technical. 

Stewart, B. T esting for critlcalJrhinking; A review of the resources . 
Urbana, in7: Illinois Criti^l Thinking Project* 1979. 



Strawson, P. F. Introduction to logical theory . London: Methuen & Co., 

1952, ' ' 

Watson, G., & Glaser, E. M. Manual for Watson'-Glaser critical thinking 
appraisal . New York: Harcourt, Brace, & World, 196A(a). > 

Watson, G., & Glaser, E. M. Watson-Glas er critical thinking apprai&al, form Ym . 
New York: Harcourt, Brace, & World, 1964(b). 

Watson, G., & Glaser, E. M. Wats on-Glaser critical thinking appraisal, form Zm . 
New York: Harcourt, Brace, & World, 1964(c). 

Wesman, A. G. Writing the test item. In R. L. Thorndlke (Ed.), Educational 
mea su r t'mcn_t . Washington, D.C.: American Council on Education, 1971. 

Wiersmn, W. Research m ethods in cducatloi^ (2nd ed.). Itasca, 111.: 
F. E. PcacockT 1975. 

Winer, B. J. Statistical princ iples in^^^exRerlmental design (2nd ed.). 
New York: "McGr'aw-lfill, 1971. 



• 



ERIC . 



54 



-53- 

Additional Readings 

s 

Freedtnan» D., Plsanl, R. , & Purve8> R. Statistics ^ N^^^ork: Norton, 1978* ♦ 
Introductory text. 

Gage, N. L. (Ed.)» Handbook of research on teaching . Chicago: Rand McNally, 
1963. ' ' 

Glaser, R. & Mltko, A. J. Measurement In leigirnlng and Instruction. In 
r R* L* Thorndlkev (Ed.), Educational measurement s Washington, D.C.: 
J American Council on Education, 1971. Contains a discussion of some 
^ uses of criterion-referenced tests. 

Glass, G., & Stanley, J. C. Statistical methods in education and psychology . 
Englewood Cliffs, N.J.: Prentice-Hall, 1970. Widely used text, moderate 
level of difficulty. 

Millman, J. Criterion-referenced measurement: Current applications. In • 
W. J. Popham (Eel.), Evaluation in education . Berkeley, Calif.: 
McCutchah, 1974. A thorough introduction to criterion-referenced 
testing. 

Scrlven, M. The methodology of evaluation. In R. W» Taylor, R. M. Gagne, 
& M. Scrlven, Per$(>^ctlves in curriculum evaluation . Chicago: Rand 
McNally, 1 96 7^7''^c riven here introduces dn important diat^nction 
between formative and summatlve evaluation. 



55 



